Client Report - Can You Predict That?

Course DS 250

Author

[STUDENT NAME]

Show the code
import pandas as pd 
import numpy as np
from lets_plot import *
# add the additional libraries you need to import for ML here

LetsPlot.setup_html(isolated_frame=True)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here

# import your data here using pandas and the URL
import pandas as pd

ml_url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
neigh_url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv"
info_url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_denver/dwellings_denver.csv"

ml = pd.read_csv(ml_url)
neigh = pd.read_csv(neigh_url)
info = pd.read_csv(info_url)

homes = ml.merge(neigh, on="parcel", how="left")
homes.head()
parcel abstrprd livearea finbsmnt basement yrbuilt totunits stories nocars numbdrm ... nbhd_802 nbhd_803 nbhd_804 nbhd_805 nbhd_901 nbhd_902 nbhd_903 nbhd_904 nbhd_905 nbhd_906
0 00102-08-065-065 1130 1346 0 0 2004 1 2 2 2 ... 0 0 0 0 0 0 0 0 0 0
1 00102-08-073-073 1130 1249 0 0 2005 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0
2 00102-08-078-078 1130 1346 0 0 2005 1 2 1 2 ... 0 0 0 0 0 0 0 0 0 0
3 00102-08-081-081 1130 1146 0 0 2005 1 1 0 2 ... 0 0 0 0 0 0 0 0 0 0
4 00102-08-086-086 1130 1249 0 0 2005 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 324 columns

Elevator pitch

A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)

A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.

QUESTION|TASK 1

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.

These charts were all showing relationships between different things that were all revolving around the numbers 0-1 mainly. Some wasy to improve the algorithms would be to do totals instead of just ratios or percentages. Also, some of the categorical variables could be grouped better to reduce the number of categories and make it easier for the model to learn patterns.

Show the code
# Include and execute your code here

# Initialize Lets-Plot
LetsPlot.setup_html(isolated_frame=True)

# Select a few features to visualize
numeric_features = ['livearea', 'yrbuilt', 'stories']
categorical_features = ['gartype_Att', 'condition_Good']

# 1. Scatter
p1 = ggplot(homes, aes(x='yrbuilt', y='before1980')) + \
    geom_jitter(width=0, height=0.02, alpha=0.3) + \
    geom_smooth(method="loess", se=True, color='red') + \
    ggtitle("Year Built vs Before1980 (LOESS Trend)") + \
    xlab("Year Built") + ylab("Before1980")

# 2. Boxplot
p2 = ggplot(homes, aes(x='before1980', y='livearea')) + \
    geom_boxplot() + \
    ggtitle("Living Area vs Before1980") + \
    xlab("Before1980") + ylab("Living Area (sq ft)")

# 3. Bar Chart
homes['before1980_str'] = homes['before1980'].astype(str)

p3 = ggplot(homes, aes(x='gartype_Att', fill='before1980_str')) + \
    geom_bar(position="dodge") + \
    ggtitle("Garage Type (Attached) vs Before1980") + \
    xlab("Attached Garage") + ylab("Count") + \
    scale_fill_manual(values=["#1f77b4","#ff7f0e"], name="Before1980")

# show plots
p1.show()
p2.show()
p3.show()

QUESTION|TASK 2

Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.

I built a Random Forest classifier to predict whether a house was built before 1980. This algorithm was chosen because it handles both numeric and categorical features well and is easily interpretted. The model achieved 100 percent accuracy because the year the house was built was 100% predictive of telling if it was built before 1980 or not. It was more of just a pass/fail test since the target variable is directly derived from the year built.

Show the code
# Include and execute your code here

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Prepare features and target
X = homes.drop(columns=['parcel', 'before1980'])  # drop non-predictive IDs and target
y = homes['before1980']

# Convert all boolean/categorical columns to numeric
X = pd.get_dummies(X, drop_first=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale numeric features
scaler = StandardScaler()
X_train[X_train.select_dtypes(include=['float64','int64']).columns] = scaler.fit_transform(
    X_train.select_dtypes(include=['float64','int64'])
)
X_test[X_test.select_dtypes(include=['float64','int64']).columns] = scaler.transform(
    X_test.select_dtypes(include=['float64','int64'])
)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Random Forest with basic tuning
rf = RandomForestClassifier(
    n_estimators=200,       
    max_depth=15,           
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42
)

rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print(classification_report(y_test, y_pred))

X_simple = homes[['yrbuilt']]
y = homes['before1980']

X_train, X_test, y_train, y_test = train_test_split(X_simple, y, test_size=0.2, random_state=42, stratify=y)

from sklearn.ensemble import RandomForestClassifier
rf_simple = RandomForestClassifier(n_estimators=100, random_state=42)
rf_simple.fit(X_train, y_train)
print("Accuracy using only yrbuilt:", rf_simple.score(X_test, y_test))
Accuracy: 1.000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2131
           1       1.00      1.00      1.00      3462

    accuracy                           1.00      5593
   macro avg       1.00      1.00      1.00      5593
weighted avg       1.00      1.00      1.00      5593

Accuracy using only yrbuilt: 1.0

QUESTION|TASK 3

Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.

This analysis shows that the most important feature for predicting whether a house was built before 1980 is the year it was built. This makes sense since the target variable is directly derived from this feature. Other features such as living area and garage type have much lower importance, indicating they contribute less to the model’s predictions.

Show the code
# Include and execute your code here
# =============================
# Feature importance plot (lets_plot) — fixed version
# =============================

from lets_plot import *
from sklearn.ensemble import RandomForestClassifier

# Ensure Lets-Plot is initialized
LetsPlot.setup_html()

# --- Get feature importances (handle missing rf variable) ---
try:
    importances = rf.feature_importances_
    feature_names = X.columns
except Exception as e:
    # If rf doesn't exist, train a quick RF on the existing X/y (safe fallback)
    print("Warning: 'rf' not found or not usable. Training a fallback RandomForest for importances.")
    rf_tmp = RandomForestClassifier(n_estimators=200, random_state=42)
    # if you only have X and y already split, they should exist; otherwise try ml -> build X,y
    try:
        rf_tmp.fit(X, y)
    except Exception as e2:
        raise RuntimeError("Couldn't train fallback RF. Make sure X and y are defined (features and target).") from e2
    importances = rf_tmp.feature_importances_
    feature_names = X.columns

# Build DataFrame of importances
fi = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
})

# Sort descending and keep top 20
fi = fi.sort_values("importance", ascending=False).head(20).reset_index(drop=True)

# Make the feature column a categorical with the exact display order (lets_plot respects category order)
fi['feature'] = pd.Categorical(fi['feature'], categories=fi['feature'].tolist(), ordered=True)

# Plot with lets_plot (use geom_bar with stat='identity')
plot = (
    ggplot(fi, aes(x='feature', y='importance')) +
    geom_bar(stat='identity', fill="#4C72B0") +
    coord_flip() +
    labs(
        title="Top 20 Feature Importances (Random Forest)",
        x="Feature",
        y="Importance"
    ) +
    theme_minimal()
)

plot

QUESTION|TASK 4

Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

type your results and analysis here

Show the code
# Include and execute your code here
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score

# Split data (keep all columns consistent)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale numeric features
num_cols = X_train.select_dtypes(include=['float64','int64']).columns
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

# Train Random Forest
rf = RandomForestClassifier(n_estimators=200, max_depth=15, min_samples_split=5, min_samples_leaf=2, random_state=42)
rf.fit(X_train, y_train)

# Predict
y_pred = rf.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)

print("=== Model Quality Metrics ===")
print(f"Accuracy : {accuracy:.3f}  (proportion of correct predictions)")
print(f"Precision: {precision:.3f}  (correctness among predicted positives)")
=== Model Quality Metrics ===
Accuracy : 1.000  (proportion of correct predictions)
Precision: 1.000  (correctness among predicted positives)

STRETCH QUESTION|TASK 1

Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 2

Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 3

Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.

type your results and analysis here

Show the code
# Include and execute your code here